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ABSTRACT 


This thesis solves a common issue in search applications. Typically, the user does not know 
exactly which terms are used in a document he is searching for. Several attempts have been 
made to overcome this issue by augmenting the document model and/or the query. In this 
thesis, a probabilistic topic model augments the document model. Probabilistic document 
models are formally introduced and inference methods are derived. It is shown how these 
models can be used for information retrieval tasks and how a search application can be im- 
plemented. A prototype was implemented and the implementation is tested and evaluated 
based on benchmark corpora. The evaluation provides empirical evidence that probabilistic 
document models improve the retrieval performance significantly, and shows which prepro- 
cessing steps should be made before applying the model. 


THIS PAGE INTENTIONALLY LEFT BLANK 


vi 





Contents 





1 Introduction 

1.1 Background and Problem Description . 
1.2 Methodology. 

1.3 Results . 

1.4 Terminology . : 

1.5 Recommended Taleratine: 


2 Document Modeling 
2.1 Finite Mixture Models . 
2.2 Infinite Mixture Models 


3 Application to Search 

3.1 Combining Estimates from Different Markov Chains . 
3.2 A Keyword-based Language Model. 

3.3 Ranking Documents . : 

3.4 Steps Towards Implementation. 


4 Implementation 

4.1 Preprocessing 

4.2 Building the Index 

4.3 Implementation of the Search ane Ranking Alsonthin 
4.4 Maintaining the Index . 

4.5 Evaluation. 


5 Evaluation 
5.1 Document Modeling Experiments 
5.2 Information Retrieval Experiments . 


6 Conclusions and Recommendations 

6.1 Discussion of Experimental Results . 

6.2 Hidden Markov Models for Topic Detection 
6.3. Empirical Priors for HDP . 


A Inference and Learning Algorithms 
A.1 Inference for LDA . 
A.2 Inference and learning for HDP 


B_ Implementation Examples 


Vil 


WWNe — — 


B.1 A MEX Function to Compute the PMF for the Number of Mixture Components 


ina CRP 


B.2. Matlab Code to Train HDP Model . 


B.3. Java Methods 
Bibliography 
Referenced Authors 


Initial Distribution List 


Vili 


61 
62 
64 


67 


71 


73 





List of Figures 





2.1 
22 
23 
2.4 
265 


5.1 
a2 
5.3 
5.4 


The basic LDA model in plate notation. . . 
The fully generative model in plate notation 


The Dirichlet process mixture model in plate notation. .......... 


The HDP model in plate notation. ..... 


Approximation of P(T = t) by continuous distributions ......... 


Optimal number of topics for the CRANFIELD dataset. ......... 
Typical likelihood behavior of a corpus D fora Gibbs sampler ...... 
Mean average precision versus the number of topics. ........... 


Precision and recall for different values of 


1X 


THIS PAGE INTENTIONALLY LEFT BLANK 





List of Tables 





5.1 
a2 
5.3 
5.4 
35 
5.6 
Dal 
5.8 
3.9 


Sample topics in Wikipedia . . 
Comparison of LDA and VSM 
Corpus statistics and baselines 
Effects of removing rare types 
Effects of stemming...... 


Effects of combining several Markov chains ................ 
Optimal number of topics percorpus ................-000- 
Comparison of a LDA-based model and a HDP-based model ....... 


Improvements over the baseline 


Xl 


36 
39 
40 
40 
4] 
42 
42 
45 
45 


THIS PAGE INTENTIONALLY LEFT BLANK 


xii 





Executive Summary 





Problem Description and Proposed Solution 


The Software Hardware Asset Reuse Enterprise (SHARE) “provides a capability for 
discovering, accessing, sharing, managing, and sustaining reusable assets for the Navy Sur- 
face Domain’s programs.” (Johnson and Blais, 2008). The purpose of this project is to avoid 
costly parallel development of system components and subcomponents. 

SHARE consists of two physically separated parts, an asset library and a card catalog. 
The asset library collects combat systems software and supporting artifacts. Many of the doc- 
uments in the asset library are therefore classified. The card catalog is a Web-based interface 
with unclassified descriptions of the assets that allows asset search. Besides the search, it pro- 
vides functions as account registry, asset submission assistance, and asset retrieval request. 
For the scope of this thesis only the search feature is of interest. 

Typical implementations of search applications are based on keywords, and count the 
number of occurrences of query terms in the documents and rank documents accordingly. 
A keyword-based implementation, however, requires the searcher to know the terminology 
used in the document that should be returned by the application. It is conceivable that similar 
components are used in different domains, in which different terminologies are used. This 
issue can be overcome by semantic search methods. 

This thesis focuses on a small facet of semantic document modeling: probabilistic 
topic models. These models group word types in a collection of documents, and the groups 
are referred to as topics. As in the case of sets in fuzzy logic, a word type can belong to more 
than one of these groups. Therefore, for each word type a probability distribution over all 
topics can be defined. 

Similarly, each document can be described by a probability distribution over all topics. 
For the search application, this allows the augmentation of a user query by terms that belong 
to the respective topics. 

In practice, a search application that is based only on the topic model produces a 
feature space that is too coarse. The result is that almost all documents in a collection are 
returned by the application. This issue can be overcome by using the topic model as an aug- 
mentation of a keyword-based algorithm. Both methods combined can improve the retrieval 
performance significantly. 


Explored Models and Implementation 


In this thesis, the focus is on two different Bayesian document models: Latent Dirich- 
let Allocation (LDA), introduced by Blei et al. (2003), and Hierarchical Dirichlet Processes 
(HDP), described by Teh et al. (2006). 

LDA is a parametric Bayesian model that specifies a prior probability distribution 
on the topics that are covered in a document. These topics form a latent feature set that 


Xlil 


describes a document collection better that just the words in a dictionary. Using this model, 
it is possible to use keywords from a query to infer the most likely topics associated with the 
query. The next step in the search process is then to find documents that cover these topics. 
As mentioned, LDA is a parametric model. The parameter is the dimensionality of the topic 
space, or simply the number of topics that should be used in the model. In LDA, this number 
cannot be inferred from the data directly. 

This issue is overcome by HDP, a nonparametric Bayesian model. In HDP, the number 
of topics is an outcome of the model and not an input parameter. HDP requires greater 
computational effort than LDA, which mitigates the advantage of not having to specify the 
number of topics in advance. 

Both document models were implemented separately and combined with a simple 
keyword-based model. Implementation for LDA is entirely written in Java, using publicly 
available libraries. The model for HDP is written in Matlab and the results are imported into 
a Java-based application. 

In addition, a search engine that can use either one of the introduced models was cre- 
ated. In order to compare results from different models and parameter settings, an evaluation 
program was also written. The whole application collection is available as a Java library, 
which allows the implementation and testing of search applications. Currently, all applica- 
tions are command-line-based, but it is an easy task to add a graphical user interface or attach 
the applications to a web server (e.g., Apache Jakarta Tomcat '). 


Experimental Results and Recommendations 


In the experiments, publicly available benchmark collections that are also used in 
the information retrieval literature were used. The results show significant improvements 
over the baseline. In addition to the comparison of the probabilistic topic model effects, the 
preprocessing steps necessary to prepare the documents before they can be processed in a 
document model for information retrieval were examined. 

The examined preprocessing steps are stemming and removing of rare types. Stem- 
ming reduces words in a document to a common stem (e.g., “running” becomes “run”). Rare 
types are words that are used less than five times throughout the collection. Since LDA and 
HDP try to discover correlations between words, such rare types can reduce the quality of the 
probabilistic topic model. In the experiments, a positive effect of stemming was shown em- 
pirically. Removing rare types, however, slightly hurt the retrieval performance. The reason 
is that the topic model augments a keyword-based model, which is improved by rare types, 
because rare types make documents more distinguishable. 

The experimental results suggest that probabilistic topic models should be imple- 
mented in the SHARE search application or other search applications for specialized do- 
mains. The implemented search application works fast enough to be used in online ad hoc 
retrieval tasks. 





'The Apache Software Foundation. Apache Tomcat. http://tomcat.apache.org/. Online, last 
accessed 26 August 2009 
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CHAPTER 1: 
Introduction 





1.1 Background and Problem Description 


The results of this thesis are thought to be applicable to the Software Hardware Asset 
Reuse Enterprise (SHARE) database, a database that contains requirements documents de- 
scribing systems and components for Navy systems development. Having an effective and 
efficient search application for the database can contribute to avoiding expensive and risky 
double developments of components and subcomponents. 

The SHARE database “provides a capability for discovering, accessing, sharing, man- 
aging, and sustaining reusable assets for the Navy Surface Domain’s programs” (Johnson and 
Blais, 2008). SHARE consists of two physically separated parts, an asset library and a card 
catalog. The asset library collects combat systems software and supporting artifacts. Many 
of the documents in the asset library are therefore classified. The card catalog is a Web-based 
interface with unclassified descriptions of the assets that allows asset search. Besides search, 
it provides such functions as account registry, asset submission assistance, and asset retrieval 
request. For the scope of this thesis only the search feature is of interest. 

Documents in the card catalog have two properties that suggest improved performance 
of semantic search over keyword search. First, the documents are human-generated free 
text. Second, the documents come from a specialized domain with non-standard terminology 
(Martell et al., 2008). 


1.2 Methodology 


The final goal of this thesis was to implement a prototype of a search engine that 
augments keyword search by a probabilistic topic model. Two probabilistic topic models are 
evaluated in detail: Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Processes 
(HDP). 

In Chapter 2, the respective document models are motivated and formalized. Methods 
for inference and learning are introduced and derived from the model. 

Chapter 3 shows how a document model can be used to augment a standard keyword 


search and how documents can be ranked according to their relevance for a query. 


In Chapter 4, implementation issues are discussed and important points of the imple- 
mentation are illuminated. Chapter 5 describes the conducted experiments and their results. 
All experiments are run on standard benchmark corpora that are used frequently in informa- 
tion retrieval. The corpora are CRANFIELD, CISI, MEDLINE, and TIME MAGAZINE!. All 
these corpora include a collection of documents, a collection of queries, and query-document 
relations. This allows the evaluation of the performance of the implemented prototype di- 
rectly. The sample documents consist of abstracts and short articles, which makes them 
comparable to the SHARE card library. 

Finally, Chapter 6 discusses the results and points to some directions for future re- 
search. 

Sample code for inference and learning on selected models is provided in the Ap- 
pendix A of this thesis. It is entirely written in R (R Development Core Team, 2008), which 
makes it very readable and easy to follow. Appendix B contains selected algorithms and 


implementations from the actual prototype. 


1.3 Results 


The experiments shown in Chapter 5 provide evidence that probabilistic topic models 
can improve a keyword search significantly. Additionally, it is shown that stemming (reducing 
word tokens to their respective stem) does not hurt information retrieval performance, while 
it reduces the storage demands of the computed index. 

Retrieval performance is corpus dependent. For CRANFIELD, the maximum mean 
average precision achieved in experiments was 45%, while on CISI only 24% was achieved. 
This confirms findings by other authors who conducted retrieval experiments on the same 
data sets. 

Furthermore, the implemented prototype is functional and can be used either stand- 
alone or embedded in a modular search engine as developed by Hawkins (2009). An LDA- 
based index can be computed very fast, in a couple of minutes, on a corpus with more than 
1,000 abstracts. An HDP-based model requires more time for the same task. It does, however, 
not require the supply of as many fine-tuned model parameters as LDA does and had slightly 


better retrieval performance in the experiments. 





‘All corpora are retrieved from: 
http://ir.dcs.gla.ac.uk/resources/test_collections/ 


1.4 Terminology 


This thesis uses the standard terminology of document modeling and information re- 
trieval. The basic unit in both fields is the word. This term can have two meanings: first, it 
can describe an entry in a dictionary or vocabulary. A vocabulary is an indexed list of words, 
which, in cases of ambiguity, will also refered to as word types or simply types. Second, a 
word can denote an observation in a document. In terms of implementation, an observation 
is often referred to as a word token or simply token. In terms of document modeling, the 
terms word and observation are used interchangeably, as are the terms word and (word) type. 


Where the term word is used, the context removes possible ambiguities. 


1.5 Recommended Literature 


A general introduction into the field of natural language processing (NLP) is given by 
Jurafsky and Martin (2008). The authors emphasize the advantages of statistical models over 
entirely rule based approaches to NLP. It is recommended to additionally look at the errata 
page and compare the models in the book with those in the original papers. 

Bayesian Inference and empirical Bayesian methods are well described by Carlin and 
Louis (1996). The authors introduce the concept of Bayesian modeling and inference on 
a very general level before they show specific applications. Different inference methods, in- 
cluding Markov Chain Monte Carlo methods (MCMC) such as Gibbs sampling, are described 
and compared. On top of that, the book contains a high-level introduction to parametric and 
nonparametric Bayesian mixture models (e.g., Dirichlet processes). A broader discussion of 
MCMC is given by Gamerman (1997). The author introduces Bayesian inference methods 
based on MCMC and shows several examples of how they are applied in practice. 

Finite mixture models are studied in in a non-Bayesian way by McLachlan and Bas- 
ford (1987). The authors describe the history and development of mixture models as well as 
many practical applications. Mixtures of normal components are in the focus of this book. 
Titterington ef al. (1985) provide a statistical analysis of finite mixture models with compo- 
nents from different parametric families and compare Bayesian and non-Bayesian inference 
methods. Several tables provide an overview describing the cases in which a particular model 
should be applied and which inference methods are suitable. 

Latent Dirichlet Allocation (LDA) is formally introduced by Blei et al. (2003). The 
authors give an overview of how LDA arises naturally as an extension of finite mixture models 


and give methods for inference and parameter estimation applied to document modeling. This 


basic model has been extended in different ways that will be discussed in Chapter 2. 

Dirichlet processes were formulated by Ferguson (1973). These stochastic processes 
represent a nonparametric Bayesian approach to stochastic modeling. The author proves sev- 
eral properties of Dirichlet processes and also shows how these models can be applied to 
known nonparametric problems. Antoniak (1974) shows how mixtures of Dirichlet processes 
can be formalized and applied in practice. Dirichlet processes are the basis for hierarchical 
Dirichlet processes, which are formally introduced by Teh et al. (2006). Section 2.2 shows 
how these can be applied to document modeling. To get a broader understanding of Dirich- 
let processes and other nonparametric Bayesian methods, the reader is referred to Ghosh 
and Ramamoorthi (2003). For a deeper discussion of the related measure-theoretic issues, 
(Billingsley, 1986) is recommended. 





CHAPTER 2: 
Document Modeling 





Probabilistic topic models attempt to capture latent structure in documents. Each word 
in a document is assumed to come from a hidden (latent) topic, and probabilistic topic models 
assign each word to the proper topic. These latent topic assignments produce document 
models that have a high likelihood to generate a given corpus. For the information retrieval 
task, however, these document models need to prove that they indeed lead to better retrieval 
performance. This will be discussed in Chapter 5. 

In all research that is presented in this thesis, a topic is considered to be a multinomial 
distribution over a vocabulary V. This allows the treatment of the problem of topic discovery 
as a parameter estimation problem. The cognitive notion of a topic is not within the scope of 
this thesis. 

In the following, unless stated otherwise, we assume that words in a document have 
the exchangeability property (Carlin and Louis, 1996). That is, the document “do not panic” 
is produced with exactly the same probability as “not do panic” or “panic not do.” More 


formally: 


Definition 2.0.1. An infinite sequence of random variables Xj,..., Xn,... 1S said to be ex- 
changeable, if for alln > 1: P(X1,...,Xn) = P(Xaq),---,Xmm)), Vr € S(n) in which 
S(n) is the group of permutations of 1,..., 7. 
Note. If X1,...,Xn,... are i1.d, they are also exchangeable, while the converse is not true 
in general. 

A simplistic approach to document modeling is provided by Nigam et al. (2000), in 
which every document is represented as a mixture of unigrams. That is, every document 
covers exactly one topic, whereas different documents can share the same topic. This model 


will not be discussed in further detail. 


2.1 Finite Mixture Models 


A mixture of unigrams model has the shortcoming that the probability of a word oc- 
curring in a document is not well explained by a single parametric distribution. A mixture 
model attempts to fit to the document a model that consists of a mixture of probability distri- 


butions, which are conditioned on a latent variable space. 


A very early topic model was introduced by Deerwester et al. (1990) and called “La- 
tent Semantic Indexing.” For topic discovery, Deerwester ef al. use singular value decom- 
position (SVD) on the word-document co-occurrence matrix. Although this model has been 
applied successfully to information retrieval and language modeling tasks, it does not have 
a Statistical foundation and can therefore not be seen as a probabilistic topic model. For a 
closer discussion of this model and its shortcomings, see Hofmann (1999). 

In finite mixture models, each observation x is thought to be distributed with density 
f(r|@,9) = Po O,f(2|) in which k is the index of a mixture component, ) is its 
parameter set, fe-|O™) its density function determining the distribution, and @ a vector of 
mixing probabilities that determines the proportions of densities in the mixture density. This 
requires @ to add up to unity, yan 6;, = 1. @ is therefore the parameter of a multinomial 
distribution (the mixing distribution) (Carlin and Louis, 1996). A general introduction to 
mixture models is given by McLachlan and Basford (1987). 

Unfortunately, it is not trivial to estimate the number of mixture components K in a 
mixture model. In the case of creating topic models, too many components would result in 
overfitting, whereas too few components would lead into few very general topics that cannot 
be used for applications, such as search, in a reasonable way. Furthermore, the additional 
variability introduced by adding a mixture component can also be achieved by increasing 
the variance in one of the components. Griffiths and Steyvers (2004) suggest a greedy hill- 
climbing algorithm to maximize P(corpus|K). This only works if the variability for each 
component is known in advance or at least assumed. Furthermore, if the number of com- 
ponents /< is fixed in advance or parameters are shared among all components, this leads to 


problems in computing the posterior and predictive distributions (Carlin and Louis, 1996). 


2.1.1 Probabilistic Latent Semantic Analysis 


For the special case of a finite mixture model, in which f;, = f is the probability mass 
function of a multinomial distribution, with parameters p”) and topic index z, this model 
represents “Probabilistic Latent Semantic Indexing” (pLSI), a probabilistic topic model in- 
troduced by Hofmann (1999). 

The probability of an observation in pLSI is then 


in which z; denotes the topic of the ith word token (w,;) and 7’ the number of topics. The 


model generating process in the pLSI model looks as follows: 
e pick a document d from the corpus with probability P(d) 
e draw a topic z from the distribution 6, the distribution over topics in document 
d 
e draw a word w from the distribution o”), the distribution over words given topic 
a 
The joint probability model for words and documents is then P(w,d) = P(w|d)P(d) in 
which P(w|d) = 7, P(w\z,)P(zld) = OT, ote 0M, where z; is the latent variable 
(Hofmann, 1999). However, pLSI does not specify how the mixture distributions 6 are 


generated. Therefore, it cannot be seen as a generative model for new documents. 


2.1.2 Latent Dirichlet Allocation 


Latent Dirichlet Allocation (LDA) is another variant of a finite mixture model and a 
Bayesian extension of pLSI. LDA was first described by Blei et al. (2003) and is based on 
pLSL 

The advantage of LDA versus pLSI is that it models how the mixing proportions 0 
for each document d are generated. The main idea is to treat these as random draws from a 
Dirichlet distribution. That is, 0“ ~ Dirichlet(a,,..., a7), in which a, is the concentration 
parameter for the jth topic. In Bayesian statistic, the Dirichlet distribution represents the con- 
jugate prior distribution for the multinomial distribution. The document generating process 
changes therefore in the following way: 

1. Choose number JN to be the number of slots in the document. Each slot will be 

filled with exactly one word. 

2. Choose @ ~ Dirichlet(q), the distribution over topics for document d. 

3. For each slot, 

(a) choose a topic 2, ~ Multinomial(@), 
(b) choose a word w,, by sampling from p(w,|z,), which is the z,,th column in 
the matrix @. 
Figure 2.1 shows the model in plate notation (Blei et al., 2003). The parameters have the 
following meanings: 

e «is the parameter for the Dirichlet distribution used as a prior for the topic distri- 

butions 

e @ ~ Dirichlet(a) is the probability distribution over topics for a given document 


(also called a multinomial distribution) 


























Figure 2.1: The basic LDA model in plate notation (after: Blei et a/., 2003) 


e M represents the number of documents in the corpus. 


N ~ Poisson(€) is a random number representing the number of words in a given 
document. 

e z ~ Multinomial(@) represents the topic of a particular slot in the document. 

e dis the set of distributions over words, with one distribution for each topic. 


w is the word chosen for a particular slot in a particular document, determined by 
zand @. 

Obviously, documents are not generated in this manner. Modeling the document generating 
process this way, however, allows us the use of Bayesian inference methods to find groups of 
words that form a topic. 


Formalization of the LDA model 


The density of a k-dimensional Dirichlet random variable @ with Se 6, = 1 and 
for allk =1,...,K : 0, > 0 1s defined as: 


Ds ee) ae “ oR-1 
f(O\a) = og ey (2.1) 
Pies iP k=1 


in which a; > 0, Vk = 1,..., A, and [’(-) is the standard gamma function (Ferguson, 1973). 
Assuming that the parameters a and @ from Figure 2.1 are known, the joint distribu- 


tion of a topic mixture 8, a sequence of topic labels z, and a sequence of N words w is given 


by 


N V 
£(8,2,wla, &) = fla) TT »l6)e(unl=. 6) = Fla) TT (6 cI0)e(wele 8))"™, 
n=1 = 
(2.2) 


in which p(z, = i|@) = 0; and n.,,, is the number of times word w, appears in the document. 


The parameter a is not written in bold letters, because we assume it to be constant, that is, 
Q, =...= ar =a. Doing this leads to a symmetric Dirichlet distribution and an exchange- 
able (see: Definition 2.0.1) stochastic process. The marginal distribution of a document is 


then obtained by integrating over 8 and summing over all z (Blei et al., 2003): 


p(wla, @) = es \a) TT yrH (z|0)p(wnlz, @)d@ (2.3) 


n=1 z 
=f # Bla) I(oa Sievnr le: ®) do. (2.4) 


This allows us to express the marginal distribution of a corpus D as the product over the 
marginal distributions of all documents, since all 6 are independent draws from the same 


Dirichlet prior: 


p(Dla, 6) = IL 100 [[22(el6)p(wnlz,6)¢6 


f=L° 2 


V Nd,wy 
“TI fs A|a) I] (OH 2|0)p(wy|z, ®) d. 


Inference 


The goal is to compute the posterior distribution of the hidden variables @ and z given 


a document and the model parameters (Blei et al., 2003): 


p(9,z, wl, p) 
P(wla, p) 


Blei et al. (2003) show that the computation of this distribution is intractable. Therefore they 





p(9, z|w, a,b) = 


use a method that is called variational inference to get around this issue. For our experiments, 


we used a stochastic simulation method called Gibbs sampling, which is explained in Sec- 


tion 2.1.3. Thus, the variational inference method will not be explained in any detail. For 
further discussion and an implementation in the LDA context, see Blei et al. (2003). 
Another approach to inference in the LDA model is Expectation Propagation (Minka 
and Lafferty, 2002). This algorithm approximates integrals over functions that factor into 
simple terms with the general form [ p(@) [ies t(O)"’d@. Equation 2.4 satisfies this con- 
dition with t,,(0) = >>, p(z|@)p(w|z, @). In order to apply Expectation Propagation, the 
terms t,, have to be approximated by terms with product form ¢,, = s,, IL of". This expres- 
sion resembles a Dirichlet distribution with parameters (3,,,, as Minka and Lafferty (2002) 


point out. An approximation to the posterior in Equation 2.4 is therefore given by 
q(0) x f (Ala) |] t.(0)"" = f (ly), 


in which y, = a, + >>, NwGw,z and f(-|-) denotes the Dirichlet density. The Expectation 
Propagation algorithm then performs an iterative optimization of the auxiliary parameters to 
compute the best approximation to the true posterior distribution function. A sample imple- 
mentation of the algorithm using the R (R Development Core Team, 2008) environment is 


provided in Appendix A.1.1. 
2.1.3. LDA with Random Word Distribution 


Griffiths and Steyvers (2004, 2006) introduce an extension of the basic LDA model, 
which is also discussed by Blei et al. (2003). In this model, the distribution over words 
specified by a topic is not known a priori but thought as random with a Dirichlet density. 
Thus, they pursue a fuller Bayesian approach, which is also fully generative. Figure 2.2 
shows the model in plate notation. Instead of having a multinomial distribution oe”) for each 
topic as a model parameter, this distribution is now a random variable following a symmetric 
Dirichlet distribution with parameter (3 (Griffiths and Steyvers, 2006). 

Formalization of the Extended Model 


Under these assumptions and by setting 3, =... = Gy = (9, in which V is the size of 
the vocabulary, and again a; = ... = ar = a, the updated joint density function is: 
N 
(0,0, w, zla, 8) = f(O\a) | | penlO)p(wnlen, 6°”) f(@™ |B). 
n=1 
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Figure 2.2: The fully generative model in plate notation (after: Griffiths and Steyvers, 2006) 


Integrating out 8 and ¢ separately yields 
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The expression inside the integral is proportional to a Dirichlet distribution with parameters 
P(n}) +V8) 


—1 +n” ”) The normalizing constant is thus a TOE 
& Ta, Fe +8) 


and therefore the final expression 


can be simplified to 
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(v) (-) 
j j 
times topic 7 was assigned in total. The total probability of a topic sequence z is then 


in which n; ’ is the number of times topic 7 was assigned to word v and n;" is the number of 


p(elo) = T] [ v(2!|0,0)F(0, 046 














(d) 
J 
the number of words in document d (Griffiths and Steyvers, 2004). 


(d) 


in which n;’ is the number of times topic 7 was assigned to a word in document d and n-"” is 


Inference 
The posterior probability of topics given a corpus is computed as 


_ _ Pw, 2la, 8) 
p(z|w, a, 3) = >. p(w, za, 3)’ (2.5) 





which is unfortunately intractable (Blei et al., 2003). As Griffiths and Steyvers (2004) show 
experimentally, the best way to estimate p(z|w) is to use Gibbs sampling, a Markov Chain 
Monte Carlo Method (MCMC). These methods are discussed in Gamerman (1997) and Carlin 
and Louis (1996) and formally introduced by Geman and Geman (1990). 

Gibbs sampling does not directly compute the posterior probability distribution but 
returns samples from the true posterior distribution after convergence. From these samples 
the actual distribution can then be estimated. After randomly assigning a topic to each word 
in the corpus (resulting in a vector 2"), the algorithm works as follows: 

Draw ZW) ~ pia = i, oe 2) 


draw 2) ~ p(zo = 52, ae oe ae 


draw z) ~ p(zr = i], ee seek 
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until convergence of the algorithm. The sampling distribution p has the following form (Grif- 
fiths and Steyvers, 2004): 


| nit) +8 nll, ta 
pz; = j|z_i, w, a, 3) x () (a) ’ 
nui; + VB nj. +Ta 





(2.6) 


) 


assignment. Later, a derivation for this full conditional probability mass function is provided. 


in which n_; ; is the number of times topic j was assigned to a word not including the current 
Appendix A.1.2 shows an implementation of a Gibbs sampling algorithm. It is very useful 
for showing the concept of Gibbs sampling to a reader who knows to read R source code. 
However, for practical purposes the execution is too slow. 
The model parameters 6 and p”) can be obtained by the transformations 
(w) 
pO a ne and 6)? = 


ks (2.7) 
? no +V6 ’ n +Ta 


This is Laplace smoothing (Zhai and Lafferty, 2004) on the samples. Since exchangeability 
(see: Definition 2.0.1) not only applies to the observations in a document, but also to the 
obtained topic distributions, model averaging cannot be applied and each estimate ¢@ and @ is 
unique for the sample. 


2.2 Infinite Mixture Models 


Using heuristics or greedy algorithms to estimate the number of mixture components 
is unsatisfying, because it causes additional computational effort. It makes sense to treat 
the number of components as a function of the number of observations. This leads to a 
non-parametric view, in which the number of parameters grows with the data. The question 
becomes whether the LDA model can be extended in a non-parametric way. This requires a 
prior distribution different than the Dirichlet distribution, which is fixed in its dimensionality. 
Furthermore, this prior distribution should not come from a parametric family, but be a ran- 
dom measure on the space of all probability distributions on the word space. Additionally, it 


should be possible to apply inference methods on the posterior distribution. 


2.2.1 Dirichlet Processes 


For the purposes of topic modeling, it is sufficient to obtain discrete distributions. A 


model that provides the desired properties is the Dirichlet process. It was proposed by Fer- 
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guson (1973). Measures drawn from a Dirichlet process are discrete with probability 1; the 
process therefore defines a non-parametric prior distribution on the space of discrete distri- 
butions. The Dirichlet process can be defined in several ways. For the purpose of document 
modeling, three are of interest. First, the Dirichlet process arises as the model described by 
a stick-breaking construction. Second, it can be defined as a distribution over partitions of a 
measurable space by the Chinese Restaurant Process. And third, it turns out to be the limiting 
distribution as the number of mixture components of a finite mixture model increases and 
approaches infinity. All three views of the model are provided in Teh et al. (2006) and will 
be restated now. 

The stick-breaking construction is a metaphor that describes how a draw from a 
Dirichlet process can be obtained. Let Gp denote an arbitrary, not necessarily discrete, ran- 
dom measure and ag > 0 a real number. Define independent sequences of i.i.d. random 
variables (7/,)2., and (@,)?2, such that 7;,ao, Go ~ Beta(1, ao) and ¢,|a0, Go ~ Go. Fur- 
ther define 7, = 7), []/y (1 — a}). The sequence + = (,,)?2, adds up to 1 with probability 
1. It thus defines a random probability measure on the natural numbers. The random measure 
G is then obtained as G = 77°, 746 9,, in which 6, is an atomic measure giving mass | to 
the point x. It can be shown that G ~ DP(Go, ag) (Teh et al., 2006). 

The Chinese Restaurant Process (CRP), derived by Pitman (2006), represents another 
view of the Dirichlet process. Imagine a restaurant with infinitely many tables. When the first 
customer arrives, he will be assigned to the first table and choose the dish for that table from a 
menu. In terms of the Dirichlet process, that means that a sample ¢, ~ Gp is drawn from the 


base distribution and assigned as the parameter for the first observation. The second customer 


1 
1+a0 


. If he joins the first customer, he will have the same dish, meaning getting the same 


to arrive will sit at the first table with probability 


ao 
1+ao 


parameter assigned. If he gets a new table, he will generate a new dish (¢2). In general, 





or take a new table with probability 








the probability of sitting at an already populated table is p(9, = @i|01,.--,@n-1) = aha 
the ratio of the number of customers sitting on that table and the number of customers in 
the restaurant. The concentration parameter a influences how often a new table is chosen 
and therefore serves as an innovation parameter. The “tables” form a partition of the sample 
space and the described process is equivalent to a Dirichlet process with base measure Go 
and concentration parameter a9. The discreteness of the Chinese Restaurant Process follows 
from the countability of the tables. 

The last view of the Dirichlet process discussed here is that of the infinite limit of a 


finite mixture model. Consider the LDA model described earlier. If one increases the number 
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of mixture components and defines the Dirichlet parameter as a = $ then, as K’ — oo, LDA 
approaches a Dirichlet process. This is shown in Teh et al. (2006). 


2.2.2 Hierarchical Dirichlet Processes 


For the task of document modeling, each document is thought to be generated by a 
Dirichlet process. In this model, Go is the base measure on the word simplex. For conve- 
nience, this should be a Dirichlet distribution, resulting in a unimodal distribution over the 
space of multinomial distributions over the vocabulary. This model is very similar to the LDA 
model. The Dirichlet process G then represents the base measure. 


Dirichlet Process Mixtures 

















Figure 2.3: The Dirichlet process mixture model in plate notation (after: Teh et a/., 2006) 


Dirichlet Process mixture models were introduced formally by Antoniak (1974). Fig- 
ure 2.3 shows a Dirichlet Process mixture model in plate notation. Gp is the base measure, 
e.g., a Dirichlet distribution with parameter \, ao the concentration parameter, and G' the 
Dirichlet process prior. 6; ~ Go is then the multinomial distribution over words that belongs 
to observation x;. Because of the discreteness of the Dirichlet process, it is very likely that 
many observations are generated by the same multinomial distribution, which is interpreted 
as belonging to the same topic. However, the Dirichlet process mixture only allows the gen- 


eration of a single document. It now seems appealing to allow each document to be generated 
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by a draw from the Dirichlet Process itself, that is, instead of having a single Dirichlet Pro- 
cess prior G, one has G,..., Gy for the IV documents in the corpus, which are independent 
given Go. 


Hierarchical Approach 


Having a Dirichlet Process prior for each document under the Dirichlet Process mix- 
ture model directly leads to a problem. Since Gy is a continuous measure, G'; and G; fora # 7 
have no atoms in common with probability one (Teh ef al., 2006). This means that topics are 
generated at the document level and not shared among the documents as desired. 

One solution to this issue is to force Go to be a discrete measure, but that would be 
too restrictive. On the other hand, it is a well known fact that a Dirichlet Process generates 
discrete measures with probability one (Ferguson, 1973). Therefore, the proposed (Teh et al., 
2006) procedure is that G‘p is generated by a Dirichlet Process itself. Go is then a discrete 
and nonparametric measure on the multinomial distributions over the word simplex and there 
is positive probability for each topic (distributed according to Go) to appear in any of the 
documents. The resulting model is presented in Figure 2.4. Teh et al. (2006) refer to this 
setting as the Chinese Restaurant Franchise. The metaphor is as follows: There is a number 
of restaurants; each has an infinite number of tables. All restaurants serve dishes from a 
global menu. When the first customer arrives, he will occupy the first table and the first dish 
is generated. Once it is generated, it will be on the global menu and therefore be available 


in all restaurants. The second customer in the same restaurant joins the first with probability 


ao 
1+a0 
1 


table, he will have the same dish as customer | with probability eer and a new dish will be 


created with probability a In general, if a new table is occupied, it will have an already 


Tas having the same dish or sits at a new table with probability . If he sits at a new 





existing dish assigned with probability proportional to the number of tables serving the same 
dish and a new dish with probability proportional to the concentration parameter 7¥. 

For the document modeling task, dishes are associated with topics, which are drawn 
from the base measure H and shared among all documents. This nonparametric setting allows 
for a potentially infinite number of topics; the actual number can be learned from the data 
directly. Since the number of topics now is determined by a stochastic process, it makes 
sense to derive a probability distribution. 

This can be done in two steps. First, it is necessary to obtain a probability distribution 
for the number of occupied tables (used topics) in a restaurant (document). Let K’(n) denote 
a Bernoulli random variable that takes the value of 1 if the nth word in a document generates 


a new topic in the document, or, correspondingly, if the nth customer occupies a new table 
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Figure 2.4: The HDP model in plate notation (after: Teh et a/., 2006) 





in the restaurant. That is, P((n) = 1) = =-f45. Let X(N) denote the number of topics 
used in a document with N words. Then X (V) = aur K(n). Now derive an expression for 
P(X(N) =k). For the case N = 1 this is trivial. For NV = 2, the probability P(X (2) = 1) = 


1-— and P(X(2) = 2) = 1- -9*. For N = 3, the expressions become a little bit more 








1+ao 1+ao 
a ao) a : _ = T(ao 
1. eS ates = Wrac)\itaoy’ For arbitrary N and k, P(X(N) = k) = s(N, kak oh, 








in which s(-,-) denote the unsigned Stirling numbers of first kind (Teh er al., 2006). This 
expression, however, cannot be computed for large NV, because both, the Stirling numbers 
and the evaluated Gamma function exceed machine precision. Therefore, a recursive formula 


is preferred: 


(apP(X(N—1) = k-1)+(N-1)P(X(N-1) = &)) (2.8) 


il: 


In Equation 2.8, the relation to the Stirling numbers of the first kind can be seen easily using 
the recurrence equation s(V,k) = s(N —1,k —1)+(N —1)s(N —1,k) with s(1,1) =1 
for N,k > O and N > k, otherwise s(N,k) = 0. This recurrence relation generates a 





triangular matrix that looks very similar to the Pascal triangle. Appendix B.1 presents an 
implementation of this algorithm for direct use in Matlab or Octave, which computes the 
probability mass function efficiently. 

Since K(n) is a Bernoulli random variable, E|A(n)]| = —*2— and V[K(n)| = 


n—1+ao 





ao(n—1 


Gal ae (Billingsley, 1986). Because K(n), A (m), m 4 n are independent, the expected 


value and variance for X (V) can then be computed easily: 


> a) =“ E[K(n)l= > eer (2.9) 





and 





> xe = VK) => ~~ (2.10) 


For large numbers of words per document and large concentration parameters, a nor- 
mal approximation can be found via moment matching. This gives an approximate distri- 
bution for the total number of tables in the Chinese Restaurant Franchise as the sum of oc- 
cupied tables over all restaurants as the sum of / normal random variables. A better fit 
that also works for smaller number of words and/or concentration parameters is provided by 
the Gamma distribution. The parameters are estimated by moment matching as well, using 
i= = and a = Be Figure 2.5 shows an example of the true distribution with overlayed 
approximating continuous densities. It was obtained using a document size of 20 words and 
ao = 1, which is the setting for a long query or an abstract. One can see from the figure that 
the gamma distribution provides a better fit than the normal distribution, because it captures 
the skewness of the true distribution. The total number of draws from the base measure (total 
number of occupied tables) in case of a Gamma approximation is the sum of / independent 
Gamma random variables. This sum does not follow a Gamma distribution, but computation 
of the probability density function is still tractable (Moschopoulos, 1985). If the total number 
of occupied tables in the franchise is known, the distribution of the number of dishes (topics) 
is easily computed by Equation 2.8. If the total number of used topics is unknown, the proba- 


bility distribution cannot be obtained in an easy way. The normal or gamma approximations, 
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Figure 2.5: Approximation of the true probability P(T’ = t) by continuous distributions 


however, allow the expected value of the total number of topics in a corpus to be expressed 
as a function of a sum of random variables, which is tractable. 


Inference and Learning on Hierarchical Dirichlet Processes 


Teh et al. (2006) suggest that a Gibbs sampling scheme (see Section 2.1.3) with di- 
rect assignment performs best for inferring the posterior distribution in an HDP setup. This 
suggestion is also supported by experimental results provided by Teh et al. (2008). 

For the direct assignment sampling scheme, tables in a restaurant are only represented 
as m,,, the number of tables in restaurant j serving dish k. In the document modeling context 
that means the number of groups of words in document j sharing topic k. Further, let 7, , 
denote the number of times the word with vocabulary index v was assigned to topic k. Let 
h(-|A) denote the prior density of the mixture components, in this case the density of a sym- 
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metric Dirichlet distribution with parameter \. The conditional density of a word x ;; in topic 
k; given all other assignments of topics to all words in the corpus and excluding the current 


assignment can be derived as 


f oT] JUA GL 21 Hk of) W(P™|\)do™ 
J oe oe, ho |Ajdo 


rit A-1 
k)\'%k T(VA) TV k 
_ SoS Tes (60°) rope Ten (6?) ao 
rit A-1 
(k)\ "ok (VA V (k) k 


Thai Pra k +5(@ iv) +) 
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Tear (ro + 6 (25, v) +) 
(rad +9) Ma Perse +) 
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Lees k ote r 
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fy" (ea) = 


























which also resembles the first factor of the Gibbs update in Equation 2.6. For the case of 
k =k", the density just becomes ye. a= “a Let now /3;, denote the overall popularity 
of topic & compared to all others. That is 0 < @, < 1 and yr By, + By = 1 in which wu is 
the index of the next unseen topic. It can be seen directly that (3,, will become smaller as the 
number of observations grows. In order to sample (3, all m,;, have to be known. In a Gibbs 
sampling environment, this is done by sampling. Antoniak (1974) derived a full conditional 


probability mass function for mj, ignoring the assignment jk: 


P(ao8,) 
Dag Be + 25. 





p(mjx = mB) = acl m)(ao8x)”. 


Here, s(-,-) are the unsigned Stirling numbers of first kind and n;., is the number of words 


in document 7 sharing topic / or, following the metaphor, the number of customers in restau- 


rant 7 having dish k. Again, a computationally more appealing recursive representation is 
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provided: 


1 
Nik 1+ anyey 





P( mix = MB, N;-4) = (a0Pep(myx = mM — 1/8, nj-4 — 1) 


+ (npn — 1 p(myx = mB, 254-1). 


Teh et al. (2006) show that 3 ~ Dirichlet(m.1,...,m.«, 7) directly depends on the number 
of groups across all documents that have topic k assigned. Referring to the restaurant fran- 
chise metaphor this is the number of tables across all restaurants that serve dish k. Having all 
the pieces together, the actual sampling for topics can be conducted. Each word xj; is now 
associated directly with a topic using an indicator variable z;;. This indicator variable is then 


estimated using a Gibbs sampling scheme with the following update: 


(nj +008) Ff," (xj) if k previously used, 


a (2.11) 
0B ufpnew (ji) if k = knee 


Dei = k\z7*,m, B) x 


This concludes the inference task. An implementation in R is given in Appendix A.2.1. For 
the learning task, a sampling scheme called auxiliary variable sampling is employed. Teh 
et al. (2006) introduce binary variables s; and continuous variables w; € [0,1] for each 
Dirichlet process (each document). Assuming that the concentration parameter a has a 
gamma distribution with parameters a and b as prior leads to the following expression for 
the posterior distribution: 


a—1+m..— es 85 : 
q(ao|w, 8) x Ag Mme 2jm1 93 p—a0(b—-D jy log wy), (2.12) 


This resembles a gamma distribution with parameters a+ m.. — aa s,; and b— ee log w;. 
Since w, and s, are conditionally independent, given a, they can be sampled independently 
with distributions 

q(w;|ao) « wi — wt, (2.13) 


which resembles a beta distribution with parameters a + 1 and n;.., and 


q(s;|Qo) ox (%) (2.14) 
Qo 


which is proportional to the probability mass function of a Bernoulli random variable. The 


hyperparameter 7 can be obtained in the same way, using (x instead of m,;. and m.. instead of 


PA 


n,... Given the last steps, it is an easy task to implement a system that models the documents 
in a corpus and learns the parameters ag and y. That the gamma distribution is a natural 
choice as a prior for ap and y has been shown by Blei and Jordan (2006). Appendix A.2.2 
shows how the auxiliary sampling scheme can be implemented. 
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CHAPTER 3: 
Application to Search 





This chapter shows how a document model can be used in information retrieval tasks. 
It starts out by producing robust point estimates from probabilistic topic models. Then an 
alternative, purely keyword-based, model is introduced and “mixed” with the topic model. 
This mixture is driven by a mixing weight that determines how much influence the topic 
model in the combined model has. 

The keyword-based model is used because related research (Wei and Croft, 2006) sug- 
gests that a probabilistic topic model itself is too coarse to obtain good information retrieval 
performance. A topic model, however, adds another aspect to keyword search by not just 
assessing whether a certain keyword is contained in the document but also evaluating the cor- 
relation between words. Thus, a relevant document can be returned by the search application, 
even if it does not share any terms with the query. 

The chapter finishes by introducing how document relevance is determined and doc- 


uments are ranked. 


3.1 Combining Estimates from Different Markov Chains 


If a Gibbs sampler (see Chapter 2) is used for inference, each individual run repre- 
sents a Markov chain that, after convergence, produces a point estimate for the type-topic 
distribution (@) and the topic document distribution (8). To get a more robust estimate for 
the distributions, several runs of a Gibbs sampler with different random seeds should be com- 
bined. 

It is tempting to define 6 = 4 ae 6: in which 6; is the point estimate from each 
particular run of the Gibbs sampler. The point estimate o would be obtained similarly. This 
is, however, not feasible because of the exchangeability of topics (see Definition 2.0.1). That 
means, the order of topics is not predetermined in advance and there is no way of controlling 
that order during the Gibbs sampling runs. The attempt to reorder the topics after the runs 
imposes a matching problem, which is not trivially solved. 

To use the computed distributions for search, it is sufficient to have a point estimate 


for the probability of a word being contained in a document: Propic(w|d). In LDA- and HDP- 


2D 


based models, this probability is computed as 


- 
Piopic(w|d) = S| p(w|z)p(zld). (3.1) 
z=1 
Because of the commutativity of the addition, the order of the topics is not relevant at all. 
To compute the actual matrix of multinomial distributions for the documents, Equation 3.1 
represents a simple matrix multiplication for the point estimates of each particular Gibbs 
sampler run. 
The point estimate for the word document distributions (¢ ) is therefore the combina- 
tion of the matrix products from each run of the Gibbs sampler: 


x 1 i. 3h 
C=—) O19. (3.2) 


3.2 A Keyword-based Language Model 


In addition to the index generated by the probabilistic topic model, either LDA- or 
HDP-based, it is necessary to compute a point estimate for the probability of a word being 
contained in a document based on the documents directly. The method that is used for this is 
called Bayesian Smoothing using Dirichlet Priors as described by Zhai and Lafferty (2004). 
This model is also used in Wei and Croft (2006) and is computed as follows: 





c(w,d! 
c(w,d) + poy va 
Jd] + 





Poirichtet(W|d) = (3.3) 

Here, ju is the smoothing parameter, c(w,d) is a function that counts how often a token of 

type w appears in document d, and |d| denotes the number of words in the document. The 

method is called Dirichlet smoothing, because ppjrichice(w|d) is the maximum a posterior 
w,d! 


(MAP) estimate of a Dirichlet-Multinomial model with prior parameter pu 5°, soe and the 


document d as evidence. 


3.3. Ranking Documents 


A search application evaluates the relevance of every document in the corpus and 
then returns an ordered list of documents or pointers to documents. This process is called 


ranking. This section introduces different methods that can be used to rank documents in a 
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probabilistic topic model. For the implemented prototype, all methods were implemented. 


Predictive likelihood, however, performs best by far. 


3.3.1 Ranking Documents Based on Predictive Likelihood 


The ranking of the documents in return to a user query is determined by the predictive 
likelihood, which roughly can be seen as the probability that the query @ was generated by 


the model of the document d. More formally: 


p(Q|d) = [[ea@ = I] (Appiricntet (g/d) + (1 — )Ptopic(q|d)) , (3.4) 
qeQ qeQ 


in which \ determines the weighting between the keyword-based model and the topic-based 
model. A document that has a high predictive likelihood will get assigned a high rank. At 
this point it becomes apparent that all probabilities need to be strictly positive; otherwise the 
product will be zero and even a very relevant document can end up with the lowest possible 


ranking. 
3.3.2 Ranking Based on Topic Distributions 


For a query that is supplied by a user, the most likely topic distribution under the 
probabilistic topic model can easily be determined. In case of an LDA model, Variational 
methods (Blei et al., 2003), Expectation Propagation (Minka and Lafferty, 2002), or Gibbs 
sampling are suitable. For the HDP-based model, Variational methods (Teh et al., 2008) or 
Gibbs sampling (Teh et al., 2006) can be used. In either method, the result will be a multi- 
nomial distribution over the topics, which looks very similar to the multinomial distribution 
over topics that is inferred for each document in the corpus. 

The ranking of the documents now can be determined based on the similarity of the 
topic distribution of the query and the topic distribution of a document. In the experiments, 
however, all similarity measures on the topic distribution (angle and divergence measures) 
were outperformed by predictive likelihood. 


Cosine Similarity 


A multinomial distribution over topics can be interpreted as a vector in Euclidean 
space, in which the number of different topics determines the number of dimensions. The 
angle between the query vector and the document vector can be used as a similarity measure. 
An angle of zero would mean the query and the document have exactly the same topics in the 
same proportions, thus giving the document the highest possible rank, whereas an angle of 
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90 degrees means that query and the document do not share any topics. 
Instead of computing the angle, it makes sense to consider the cosine of the angle, 
which maps orthogonal vectors to zero and vectors with the same direction to one. The 


cosine similarity of a query g and a document d is then defined as: 


. (94, Ga) 
sim(q,d) = ———__, 
= Te, 
in which (-, -) denotes the cross product and ||-|| the Euclidean norm. 


Kullback-Leibler Divergence 


Since multinomial distributions over topics are probability measures, their distance for 
a query and a document D can be determined using the Kullback-Leibler (KL) divergence. 


It is defined as: Pe 


Dx (p(21Q)||p(2|D)) = >) v(zIQ) log 


z=) 


p(2|Q) 
p(2|D) 
From the definition, it is obvious that neither p(z|Q) nor p(z|D) can be zero for any topic z. 
Further, it is clear that Dir (p(z|Q)||p(z|D)) => 0 with equality if p(z|Q) = p(z|D). This 


is different from the cosine similarity, in which the value is larger, if the vectors are more 





similar. In KL divergence, large numbers express a high distance between the measures. 
The definition of KL divergence shows that it is not a symmetric measure. To get 
around this issue, the Jensen-Shannon (JS) divergence was defined as the arithmetic mean of 


the two possible KL divergences: 


Dys (p(z|Q)|lp(2|D)) = ; (Diz (p(z|Q)|lp(2|D)) + Dict (v(2|P)IIp(21@))). 


3.4 Steps Towards Implementation 


Computing the keyword-based model and averaging topic models can easily be imple- 
mented as matrix operations. Most general purpose languages provide efficient data structures 
and algorithms for basic matrix algebra. Predictive likelihood is very efficient if the query is 
represented by a sparse vector structure. The ranking then reduces to a few table lookups and 
multiplications per document followed by a sorting algorithm on the relevance scores, which 
is O(n log n). Similarly, cosine similarity and divergence measures have efficient implemen- 
tations, although they did not perform well in the preliminary experiments. 

Altogether this allows for practical implementations for online search and ad hoc 


retrieval. 
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CHAPTER 4: 
Implementation 





This chapter will illuminate some of the implementation steps that were necessary 
for building a search engine based on probabilistic topic models. For the implementation, 
MALLET (McCallum, 2002) was used. MALLET is an open source toolkit, written in Java. 
It provides most of the functionality that is needed for document modeling, clustering, and 
classification tasks. 

Matrix computations are implemented using COLT, a matrix API for Java, built by 
the CERN institute (Hoschek, 2004). It performs standard matrix operations like addition and 
multiplication for large matrices with double precision numbers very efficiently. In addition, 
it provides a very fast and memory-efficient implementation of sparse matrices, which are 
used frequently in natural language processing tasks. 

The prototype implementation is built into the framework of Hawkins’ (2009) search 
application and can be used as a search module in this modular framework. Java is used as 
the implementation language, allowing the use of well-developed APIs for text processing, 
matrix algebra, and stochastic processes. 

Examples in this chapter will mostly refer to the CRANFIELD benchmark corpus. 
For a discussion of the improvements of the implemented prototype on other corpora see 
Chapter 5. 

The source code documentation provides a more detailed description of methods and 


fields that are used by the application. 


4.1 Preprocessing 


It is important that queries and documents are preprocessed in the same way to ensure 
that string comparison leads to correct results. That is, the vocabulary for a document in the 
corpus has to be exactly the same as for a query. 

In MALLET there is a special class, the Pipe-class, which fullfills the preprocessing 
task. Pipe is an abstract class, which is extended by several classes, each of them carrying out 
a particular preprocessing step. Every document is then “piped” through a list of these Pipe- 
objects and the end result is a preprocessed document that can be used for index building. 


The idea of using a pipe makes MALLET very flexible and attractive for text pro- 
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cessing tasks. It allows us to arrange processing modules in a list and send the raw text 
data through all necessary steps from creating a character sequence, then a token sequence, 


remove stopwords and so on. 


4.1.1 Tokenizing 


The first step in preprocessing raw documents is to tokenize them. This step splits a 
character sequence into a token sequence, typically on white spaces and special characters 
like commas or periods. With MALLET, after reading the character stream, the sequence 
is then processed through a pipe, in which a regular expression pattern is applied on the 


character stream to remove special characters and digits. 


4.1.2 Stemming 


99 66. 99 


Stemming maps inflections of a word to a common stem. That is, “running,” “ran, 
and “run” are all mapped to “run.” One of the most common stemming algorithms is the 
Porter Stemmer (Porter, 1980). A Porter stemmer applies a small number of rules of the form 
“(condition) S1 — S2” on every token that meets the condition and has form S1 and changes 
it to $2. Stems found by a Porter stemming algorithm do not necessarily agree with the 
linguistic stem of a word. For example, “happy” will be stemmed to “happi,” which would 
not be called a stem by a linguist. However, stemming reduces the number of distinct words 
in a vocabulary, thus reducing noise. Unfortunately, this comes at the cost of introducing 
additional ambiguities into the documents. 

For a probabilistic topic model like LDA- or the HDP-based model, these additional 
ambiguities result in stronger correlation between similar documents, thus improving the 
recall in the information retrieval task. For the example of the CRANFIELD data set, a Porter 
stemmer leaves 1838 types total. An additional advantage of applying a stemming algorithm 
is the reduced storage requirement for the final index. Without stemming, the CRANFIELD 
index requires 79 MBytes on the hard disk; stemming reduces its size to 53 MBytes. 

For the experiments in this thesis, a Porter stemming algorithm was implemented as a 


subclass of a MALLET pipe, such that it could be easily integrated as preprocessing step. 


4.1.3 Stopword Removal 
Stopword removal allows the omission of very frequent terms that are shared in al- 
most all documents with high probability. These words (e.g., “and,” “the,” “a”) do not carry 


meaning and are therefore not useful for information retrieval tasks. If they are left in the doc- 


ument, they consume computation time and add noise to the model. Therefore, it is standard 
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procedure to remove them in advance. The English stop word list that was used through- 
out the experiments contains the 571 most common words in English documents. MALLET 
provides a pipe implementation that removes stopwords as part of the preprocessing. 

The CRANFIELD corpus consists of 7045 unique words; after removing stopwords, 
6639 remain. This was measured without stemming first. For the actual implementation, 


however, it is important to stem first and then apply stopword and rare type removal. 


4.1.4 Removing Rare Types 


Rare types impose a different issue. Since they show up in the corpus in only one 
or two documents, it is practically impossible to compute the correlation with other words. 
Additionally, removing rare types reduces the chance of having missspelled words in the 
index after preprocessing. 

Of the 6639 types that remain in the Cranfield corpus after stopword removal, 4112 
are used fewer than 5 times throughout the corpus. Removing these leaves an index with 2527 
types total. Having a smaller number of types results in a smaller index, which can be stored 


in memory directly. 


4.1.5 Building Termvectors 


For the Expectation Propagation inference algorithm, each document and query needs 
to be represented as a term vector. A term vector is a vector of length V, the size of the 
vocabulary, whose entries are the number of tokens of type w in the document. MALLET uses 
for this purpose an additional pipe, the ““FeatureSequence2Feature Vector” class. However, for 
Gibbs sampling, a term vector representation is not suitable; a term sequence has to be used 


instead and this preprocessing step must be skipped. 


4.2 Building the Index 


For the purposes of this thesis, the index is just a matrix V x D in which V is the 
number of types and D the number of documents in the corpus. Each column of this matrix 
represents a multinomial distribution over types, which means that each document is repre- 
sented as a vector on the V—1-simplex. In order to be able to use the predictive probability for 
determining the relevancy of a document for a particular query, this matrix has to be dense, 


that is, it cannot have any zero values. 
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4.2.1 Building the LDA Index 


MALLET implements a very robust and efficient version of an LDA document model, 
which uses a parallel Gibbs sampling algorithm. This allows the use of several Markov chains 
to estimate the true topic distribution for each document. For the search engine implementa- 
tion, the number of chains is a parameter that can be supplied by the user. 

After convergence, an estimate of the document topic distribution and the type topic 
distribution is computed using Equation 2.7. Since topics are exchangeable, it is not possible 
to average 6 or ob from different samples or even different chains. However, the final proba- 
bility distributions p(w|d) = >>, 66, the probability of a word occuring in a particular 
document under the LDA model, can be averaged. 

The implementation estimates p(w|d) for each Markov chain and averages these esti- 
mates. This results ina V x D matrix, which holds the aggregated probabilistic topic model 
of the corpus. 


4.2.2 Building the HDP Index 


In order to estimate the topic distribution under an HDP-based model, the implemen- 
tation by Teh et al. (2006) was used. The code runs in Matlab, and Octave was used as an 
intermediate step to convert the binary Matlab format into a csv file, which then was imported 
by a Java class. 

The type-topic distribution estimate ¢ is then computed in exactly the same way as in 
Equation 2.7. For the topic-document distribution, there is no equivalent to the hyperparame- 
ter a in the LDA model. Therefore, smoothing was ignored. The final distribution, however, 
cannot have any zero values because the type-topic distribution will not contain any zero val- 
ues and every row in the topic document distribution will have at least one value greater than 
zero. 

As with the LDA model, the estimates from several chains are averaged to obtain a 
more robust estimate of the probability distribution p(w|d). 


4.2.3 Building the Smoothed Language Model 


For the keyword-based search, it is necessary to implement a language model that is 
based on word counts only. In order to use predictive probability for ranking documents, 
this language model needs to be smoothed. The formula for estimating the language model, 
Equation 3.3 is used. 


Computing this probability distribution can be done in a single pass over the term 
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vectors and results ina V x D matrix, in which every column represents the probability 
distribution over word types for a particular document. It therefore has the same structure as 


the probability distributions computed by the probabilistic topic models. 
4.2.4 Building the Final Index 


The final index then is simply the weighted average between a probabilistic topic 


model and the smoothed language model: 
p(w|d) = Appirichtet(wld) + (1 — A)prpaynpP(wld), (4.1) 


in which 0 < A < 1 is the weighting parameter. This is implemented by a simple matrix 
addition, which produces a dense matrix of size V x D. 

The index also stores the averaged matrices from the probabilistic topic model and the 
smoothed language model. This allows changing the values for \ and y later, after the index 
has been built, and thus prevents having to retrain the whole model. 


4.3 Implementation of the Search and Ranking Algorithm 


4.3.1 Preprocessing the Query 


The query is provided by the user of the search application as a simple string. It needs 
to be tokenized, stopwords and rare types need to be removed, and the words need to be 
stemmed, before the query is finally turned into a term vector. 

For this task, a separate pipe, the “query pipe,” is used in the implementation. It is 
important that the query pipe use exactly the same preprocessing steps as are used on docu- 
ments. It is also necessary that the query pipe have the same word dictionary (in MALLET 
called the alphabet) as the pipe for the documents had, because the terms in the term vector 
need to have the same index. That is, if in the document corpus the term “experiment” has the 
index 5, this needs to be true for the query as well. Otherwise the matrix lookup will return 
the wrong result. 

However, the original word set must not grow, if the query contains a word that is 
not contained in the corpus. Therefore, the alphabet must not be supplied by reference to the 
original object. In the implementation, the alphabet is cloned first and the reference of the 


clone is passed to the query pipe. 
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4.3.2 Computing the Ranking Score 


After computing the combined index as a weighted average between the probabilistic 
topic model and the smoothed language model and turning the user query into a term vector, 
computing the ranking score is straightforward by applying Equation 3.4. However, multi- 
plying a sequence of small probabilities can lead to numerical instabilities. Therefore, the 


natural logarithm is used instead: 


scoreg(D) = bs c(w, Q) log p(w|D), 


wed 


in which c(w,@Q) counts how often the term w appears in the query Q. This count will 
typically be one, since words are rarely repeated in a query. If, however, the query is a full 
paragraph or a question, it might happen that terms are repeated. For the similarity measure, 
this means that documents that share that particular word or that have many words that are 
correlated with this word, will get a higher ranking. Repeating terms in the query thus has a 
boosting effect for that particular term. 

MALLET’s term vector implementation is very useful for this computation because 
it is very memory efficient by storing only two arrays of integers. One holds the vocabulary 
indices of all non-zero count terms, the other holds the actual counts. 

After computing the scores for all documents, the documents are ordered accordingly 


and returned to the user. 


4.4 Maintaining the Index 


This section describes how the special cases of adding and deleting documents are 
dealt with. Both cases only allow for marginal changes on the corpus. If many documents 
are added or removed over time, it is best to retrain the indexer. In case of the LDA-based 
model, this takes only a few minutes, whereas for the HDP-based model this requires some 


hours and is more involved. 

4.4.1 Adding Documents 

As with a query, the new document has to be preprocessed. After this step, a topic 
distribution is inferred. For the LDA implementation, MALLET’s TopicInferencer is used. 


For HDP, the document is treated as a test document versus the rest of the corpus. In both 


cases, a Gibbs sampler generates the necessary topic distribution. This new distribution over 
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topics is now multiplied by the already existing word-topic distribution, which results in a 
vector of length V, the size of the vocabulary. 

Additionally, the smoothed language model for the document is computed, again us- 
ing the known base distribution. Finally, the two obtained vectors are averaged according to 


the defined weighting scheme and the final result is attached to the index. 


4.4.2 Removing Documents 


Removing a document from the corpus is implemented in the easiest possible way. 
First, the document is removed from the feature seqence list. Then the corresponding columns 
in the topic model, language model, and combined index are removed. While this ensures that 
the removed document is not accidently returned by the search application, it leaves the base 
distributions from the probabilistic topic model and the smoothed language model untouched. 
For a single document, this is not critical, because a single document does not have a huge 
influence on the base distributions. If, however, many documents or a very big document are 


removed, the index should be generated again with the new corpus. 


4.5 Evaluation 


This section describes how benchmark queries, for which the relevant documents are 
known, are implemented in a way such that the retrieval performance of the prototype can be 
computed. Additionally, it describes, which methods allow the computation of performance 


measures. 


4.5.1 Query Representation 


Each query is wrapped into an instance of class “Query.” Every instance holds an 
internal identifier and a data set identifier. This is necessary because the benchmark data sets 
do not provide a consecutive numbering of the queries. The third field that is maintained in 
the class is a single string value, which represents the actual query. Since the query object can 
be used in any search module that fits in the search framework, this string is not preprocessed. 

In order to allow performance assessment, each query holds a list of document iden- 


tifiers, which contains all the documents that are labeled relevant to the query. 


4.5.2 Query Set Representation 


An instance of class “QuerySet” is a container for “Query” instances. 
The class provides a static method that reads text files with queries and query-document 
relevance pairs to generate the query set. As an instance method, “trimToSize()” shall be men- 
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tioned here. It removes irrelevant queries from the query set. A query is called irrelevant if it 


does not have relevant documents assigned to it. 


4.5.3 Computing Performance Measures 


The main class for computing performance measures is the “Evaluator” class. An 
instance of this class is generated with an object of class “ModularSearchEngine’’, a “Mod- 
uleMixer” object (Hawkins, 2009), a “QuerySet” instance, and an integer indicating the max- 
imim number of documents that the user wants to be retrieved. 

An “Evaluator” instance provides methods to compute average precision for each 
query (“computeAveragePrecision()”’) and the mean average precision over all queries (““com- 
puteMeanAveragePrecision()”’). Additionally, it computes a confusion matrix for each of the 
queries and at every possible number n, 1 <n < N, in which N is the maximum number of 


documents retrieved. 
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CHAPTER 5: 
Evaluation 





5.1 Document Modeling Experiments 


This section describes exeriments that were used in the document modeling context. 
It shows how topics emerge and can be interpreted and how hyperparameters and the number 
of topics can be estimated to obtain a document model with high likelihood. 


5.1.1. Topic Detection 


Griffiths and Steyvers (2004) provided a Matlab toolbox that uses Gibbs sampling to 
obtain samples from the posterior distribution of the latent topics. This toolbox was the basis 
for first experiments on topic detection and parameter estimation. For a smaller corpus with 
artificial data, the LDA Gibbs sampler was implemented in R (see Appendix A.1.2) and used 
for small scale experiments which are not discussed. 

The first experiments we ran on a stratified sample of 1,000 document from the 
Wikipedia! collection. Table 5.1 shows the 10 most likely words in five sample topics based 
on a single Markov chain of the Gibbs sampler and 200 topics. The smoothing constant was 
set to 3 = 0.01. The documents were preprocessed as described in Section 4.1 with the ex- 
ception of stemming. For this experiment, it was important to get topics with readable words, 


which is not possible if stemming is applied. 


5.1.2 Estimating the Number of Mixture Components 


Griffiths and Steyvers (2004) suggest a hill climbing method to estimate the number 
of topics in a given corpus. This requires the hyperparameters a and (3 to be known and 
fixed. The idea is to compute the posterior probability of a corpus given the number of topics, 
P(w|T). Unfortunately, this is intractable, since it requires the computation of P(w|T’) for 
any conceivable topic distribution. The topic distribution itself is a draw from a continuous 
Dirichlet distribution and therefore the number of possible topic distributions is uncountable. 
Griffiths and Steyvers suggest running a Gibbs sampler on the model with different Markov 


chains and estimating the resulting posterior probability P(w|z) from the samples of each 








'Wikimedia Foundation Inc. Wikipedia: The Free Encyclopedia. http://download.wikimedia. 
org/enwiki/latest/. Online, last accessed 28 April 2008 
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Topic 2 Topic 3 Topic 13 Topic 28 Topic 51 
ball treatment software god greek 
play medical computer christian zeus 
team acupuncture | hardware chruch mythology 
player disease video jesus gods 
football pain disk christianity | god 

line studies computers | believe son 
offensive evidence memory book aeneas 
defensive effects bit christ myth 
pass found operating holy goddess 
field patients screen faith temple 


























Table 5.1: Sample topics in Wikipedia 


chain. These probabilities are then averaged by applying the harmonic mean: P(w|T — 


=r & 7;—, in which k denotes the number of samples taken. The number of topics, that 
k=1 Plwle,) 


maximizes P (w|7’) will then be accepted. This procedure was applied in our research with 8 
Markov chains and 10 samples for each chain, giving A = 80 on the Wikipedia corpus with 
G8 =0.01. 

An alternative approach uses averages over the natural logarithm of the likelihood and 
has better numerical stability. Figure 5.1 shows how this method works on the CRANFIELD 
data set based on 10 samples per number of topics. The concavity of the likelihood function 
becomes even more obvious with more samples taken per topic value. The gap between two 
design points should be at least 10 because there is almost no difference in likelihood between 
two models with a specified number of topics that only differs by one or two. For all practical 
purposes, it is sufficient to determine the number of topics rounded to the closest multiple of 


ten. 


5.1.3. Number of Iterations for the Gibbs Sampling Algorithm 


Determining the optimal number of iterations for burn-in and lag between samples for 
a Gibbs sampler is not trivial and still an open field of research. Furthermore, it is a hard task 
to determine if the Gibbs sampler converged to the target distribution (Brooks and Roberts, 
1998). The major problem is that a Gibbs sampler may spend many iterations in a local 
optimum before it finally converges to the right solution. Since it is a stochastic algorithm, 
this number of iterations cannot be predicted precisely. Results to determine bounds on the 


number of iterations exist for a few special cases, which do not include LDA- or HDP-based 
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Figure 5.1: Optimal number of topics for the CRANFIELD data set. The solid line represents the likelinood 
function, while the dashed lines represent the bounds of a 95%-confidence interval. For every number of topics, 
10 independent Gibbs sampling runs were used. The optimal number of topics will be determined as 40, given a 
smoothing parameter of 3 = 0.007. 


models. Raftery and Lewis (1992) suggest that this number should be less than 5,000. In the 
LDA and HDP literature, typical numbers are between 1,000 and 2,000. 

Figure 5.2 shows the typical behavior of the corpus likelihood P(D) during a run of 
a single Markov chain. At about 150 iterations, a local maximum can be observed, from 
which the likelihood first drops, before it climbs to a pretty much stable value after 1,200 
iterations. Figure 5.2 was produced by an HDP model based on the CRANFIELD data set. 
The suggestion is to use at least 1,000 iterations as burn-in time and then at least 20 samples 
with a lag of at least 100 iterations. These settings regularly produced good results in the 
experiments, whether LDA-based or HDP-based. 
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Figure 5.2: Typical Likelihood behavior of a corpus D for a Gibbs sampler in the HDP setting 


5.2 Information Retrieval Experiments 


Table 5.2 shows how a probabilistic topic model can improve retrieval results. As an 
example, query 153 from the CRANFIELD benchmark corpus is used and for each method 
the top ten documents were considered. The table only contains relevant documents. The 
document 1083 does not share keywords with the query and therefore it will not be returned 
by any method that bases on keyword search alone. A probabilistic topic model alone how- 
ever would return too many documents that are not relevant to the query. In Table 5.2, the 
fourth column shows how a combination of both methods, keyword search and a probabilistic 
topic model, can improve retrieval performance. Though this is just a single sample, it shows 
in which way a probabilistic topic model and a keyword-based search augment each other. 
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Query | pure LDA | pure VSM | mix with \ = 0.8 
153 1078 1081 1078 
1082 1082 1081 
1083 1085 1082 
1085 1083 
1085 























Table 5.2: Comparison of pure LDA, pure VSM, and a combination of both with \ = 0.8 based on query 153 of 
the CRANFIELD benchmark corpus. 


5.2.1 Evaluation Metric 


As the metric to compare the retrieval performances of different models and/or dif- 
ferent parameter sets, the mean average precision (Robertson, 2008) is used. The average 


precision for a single query is defined as 


in which RF is the number of total relevant documents and D denotes the total number of 
documents in the corpus. The contribution of document d,, to the average precision AP,, is 
defined as eh 
AP, = — }. 

in which 0,,,, = 1, if the documents d,, and d,,, are both relevant to the query and 6,,,, = 0 
otherwise. The mean average precision is then the mean of the average precision values over 
all queries. 

Mean average precision was chosen because it is the metric used in related research 
on the same benchmark data. A Java implementation of the algorithm is presented in Ap- 
pendix B.3.1. 


5.2.2 Baselines 


For each benchmark corpus, a baseline was defined. This baseline comes from pub- 
lished Information Retrieval papers (Roussinov and Fan, 2006) or, in case of the TIME MAG- 
AZINE data set, from the best experiment that only uses keyword search and that applied 
stemming, stop word and rare type removal before training the index. Table 5.3 shows the 


corpus statistics and the used baselines. 
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Corpus Number of Number of Mean average 
documents | relevant queries | precision baseline 
CISI 1460 76 0.200 
CRANFIELD 1398 225 0.392 
MEDLINE 1033 30 0.518 
TIME MAGAZINE 423 83 0.526 























Table 5.3: Corpus statistics and baselines for the information retrieval experiments. 


5.2.3 Effects of Removing Rare Types 


Removing types that occure infrequently hurts the average precision in the retrieval 
task. Table 5.4 shows the differences in retrieval performance with rare type removal and 
without. While the retrieval performance with rare types is slightly higher, the storage re- 
quirements for the index grow drastically. 

An analysis on the aggregated values for mean average precision leads to overlapping 
confidence intervals and therefore no statistical significance. If the analysis is conducted on 
the difference in average precision for each query, the result changes. The fourth column 
in Table 5.4 shows the p-values computed by a Wilcoxon signed rank test based on the dif- 
ferences in average precision per query. The alternative hypothesis is that with rare types 
in the corpus, the average precision is greater than without. A Wilcoxon signed rank test 
conducted on all queries regardless of the corpus results in a p-value of 0.005, which shows 
significance even at the one percent level. Of course, this result cannot be extrapolated to an 
unseen corpus. 

Experiments for the values in Table 5.4 are generated with smoothing factor p. = 700, 
10 Markov chains, 345 topics, 800 iterations for the Gibbs sampler and 8 parallel threads 
for the topic model estimation. The prior parameter for LDA was fixed at 3 = 0.01 and the 


mixing proportion was \ = 0.7. 
































Corpus with rare types | without rare types | p-value | storage difference 
CISI 22.80% 22.05 % <0.01 71 MBytes 
CRANFIELD 44.02% 42.30% 0.03867 42 MBytes 
MEDLINE 59.28% 59.91% 0.5803 108 MBytes 
TIME MAGAZINE 59.28% 54.82% 0.037 90 MBytes 








Table 5.4: Effects of removing rare types applied to the different corpora. The metric is mean average precision. 
The p-value is computed per corpus based on direct comparison of the query results 
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5.2.4 Influence of Stemming 


The influence of stemming was studied on an LDA based model mixed with the 
smoothed language model at a weighting coefficient of ) = 0.7. The prior parameter on 
the type topic distribution was (3 = 0.01 and the smoothing constant = 700. It was trained 
with KY = 345 topics and 10 Markov chains. Table 5.5 shows the results in mean average 
precision and storage reduction for the index after applying stemming. The analysis is the 
same as described in Section 5.2.3. 

For CISI, CRANFIELD, and MEDLINE, the test shows significant performance in- 
crease at the 10 percent level, but not at the five percent level. For TIME MAGAZINE, the 
test does not show significance. Applied to the set of all queries, the Wilcoxon signed rank 
test returns a p-value of 0.002, which shows significance at the one percent level. Again, it is 


likely that this result does not apply to an unseen corpus. 

















Corpus with Stemming | without Stemming | p-value | Storage Reduction 
CISI 22.05% 20.03 % <0.01 17 MBytes 
CRANFIELD 43.29% 42.30% 0.075 26 MBytes 
MEDLINE 59.91% 58.14% 0.057 10 MBytes 
TIME MAGAZINE 54.82% 52.53% 0.222 12 MBytes 


























Table 5.5: Effects of stemming applied to the different corpora. The metric is mean average precision. 


5.2.5 Influence of the Number of Markov Chains 


The number of independent Gibbs sampler runs heavily influences the quality of the 
probabilistic topic model whether it is LDA- or HDP-based. Since every run results in a 
point estimate generated by a stochastic process, it makes sense to obtain several independent 
estimates and produce a more robust estimate for the model distributions. 

For all corpora, there is a significant increase in retrieval performance if the index 
combines the estimates from several Markov chains rather than a single chain. All models 
were trained with 3 = 0.006, a value that consistently lead to good results on all corpora, and 
700 topics. The smoothed language model was not used for this test, which means that the 
weighting parameter in Equation 4.1 was set to \ = 0. 

Thus, in general, the more Markov chains are evaluated, the closer the estimate ap- 
proaches the correct distribution, which improves the retrieval performance. This comes at 
the cost of computational effort. In our experiments, 10 Markov chains was always a reason- 


able number that also agrees with related research (Wei and Croft, 2006). 
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Corpus 1 chain | 3 chains | 10 chains | p-value 1 - 10 
CISI 14.47% | 16.93% | 18.25% <0.01 
CRANFIELD 34.84% | 37.97% 41.19 <0.01 
MEDLINE 51.02% | 51.20% 54.97 0.021 
TIME MAGAZINE | 48.10% | 49.75% | 53.99% <0.01 


























Table 5.6: Effects of combining several Markov chains. The metric is mean average precision on a pure LDA 
model. The p-value is the comparison between results from one and ten chains. 


5.2.6 Influence of the Number of Topics for LDA 


As Griffiths and Steyvers (2004) show, the number of mixture components or topics 
has siginificant influence on the perplexity of a document model. It is therefore natural to 
assume that the same result holds for the task of information retrieval. Too small a number of 
topics would result in a few very general topics and the model became a smoothed language 
model, whereas too large a number of topics leads to a mixture of unigrams model that does 
not share topics among documents. Thus the assumption is that there is a specific number of 
topics that maximizes the retrieval performance. 

In the experiments, this assumption could not be rejected. Related research (Az- 
zopardi et al., 2003), however, shows that the best document model for a corpus is not neces- 
sarily the best information retrieval model. This can be seen directly by comparing the plot in 
Figure 5.1 with Figure 5.3. The document model for the CRANFIELD data set maximizes its 


likelihood at 40 topics, whereas the information retrieval model performs best at 1050 topics. 

















Corpus number of topics | mean average precision 
CISI 700 17.48% 
CRANFIELD 1050 42.43% 
MEDLINE 500 56.32% 
TIME MAGAZINE 900 58.85% 




















Table 5.7: Optimal number of topics per corpus and achieved mean average precision. The smoothing parameter 
for all corpora is 6 = 0.007 


Table 5.7 shows the optimal number of topics for the four benchmarck corpora ac- 
cording to the best retrieval result obtained with this setting. This number was determined 
by a greedy heuristic on a restricted domain (300 to 1000 topics) submitted to a high per- 
formance cluster, which allowed to estimate several hundreds of parameter combinations at 


once. All models were trained with 10 independent Gibbs sampling estimates. 
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Figure 5.3 shows the typical behavior of the mean average precision as a function 
of the number of topics. The figure is based on the CRANFIELD dataset with @ = 0.007, 
10 Markov chains and \ = 0. It cannot be rejected from the plot that the function is con- 
cave. In all experiments, hillclimbing over the number of topics gave good results in retrieval 


performance. 





0.43 
! 


Mean Average Precision 














Number of Topics 


Figure 5.3: The mean average precision versus the number of topics on the CRANFIELD dataset based on 10 
samples for each design point. Other parameters were 3 = 0.007, A = 0, and 10 independent Gibbs sampling 
runs. The concavity of the function cannot be rejected from this plot. The optimal number of topics for the 
CRANFIELD data set is determined as 1050. 


5.2.7 Influence of the Weighting between the Topic Model and the 
Language Model 


Figure 5.4 shows how the weighting between the probabilistic topic model and the 
smoothed language model influences retrieval performance. Recall is the proportion of rele- 
vant documents that were returned by the application. Precision is the ratio of the number of 
relevant documents and the total number of returned documents. Ideally, a precision-recall 
plot starts almost horizontally at recall zero and precision close to one and stays that way 


until it drops to precision zero at recall one. In practice, this is rarely the case considering the 
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random nature of documents and queries. 
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Figure 5.4: Precision and recall for different values of A on the CRANFIELD dataset. 


Clearly, the choice of influences the retrieval performance. Values between \ = 0.7 
and \ = 0.8 worked best for all corpora in all experiments. The plot in Figure 5.4 also shows 
how the keyword-based search is augmented by the probabilistic topic model: both of the 


search methods perform worse than their combination. 


5.2.8 Difference between LDA and HDP 


Table 5.8 shows results of a direct comparison of an LDA- and an HDP-based model 
with weigthing parameter \ = 0. For both models, a symmetric Dirichlet distribution with 
smoothing constant 3 = 0.007 was used for the prior base distribution. The preprocessing 
for both models and all corpora were the same. The computation of the p-value in Table 5.8 
was done by a Wilcoxon signed rank test based on paired observations for each query. The 
second and third column show the respective mean average precision. 

The results show significant improvement for HDP over LDA on the CISI and the 
MEDLINE dataset. For CRANFIELD and TIME MAGAZINE, there is not enough evidence 
to favor the alternative hypothesis that HDP performs better than LDA. If all queries are 
compared pairwise ignoring the corpus factor, the resulting p-value is 0.076, which leads 
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to rejection of the null hypothesis at the 10 percent level of significance, but not at the five 
percent level. 

















Corpus pure LDA | pure HDP | p-value 
CISI 18.25% 19.62% 0.025 
CRANFIELD 41.19% 41.47% 0.307 
MEDLINE 54.97% 59.10% 0.016 
TIME MAGAZINE | 53.99% 53.20% 0.591 























Table 5.8: Comparison of a pure LDA-based model versus a pure HDP-based model with smoothing constant 
3 = 0.007. P-value is based on a Wilcoxon signed rank test with alternative hypothesis that LDA performs worse 
than HDP. 


5.2.9 Improvements over the Baseline 


Finally, it is interesting to determine if the prototype leads to improvements over the 
baseline. The results are presented in Table 5.9. For all corpora, improvements in information 
retrieval were achieved based on mean average precision. These improvements range between 
three percent for CISI up to more than 10 percent for MEDLINE. 

For the baseline values, it is not possible to state the statistical significance based on 
per query comparison, because this is not available in the literature. The only statistical tests 
that are valid in this case are a sign test and a Wilcoxon signed rank test based on the aggre- 
gated values. Both tests result in a p-value of ig = 0.0625. Here, the null hypothesis is that a 
probabilistic topic model does not lead to significantly different retrieval performance, while 
the alternative hypothesis is that a probabilistic topic improves retrieval performance. With 
a p-value of 0.06, the null hypothesis can be rejected in favor of the alternative hypothesis at 
the 10 percent level of significance, but not at the five percent level. 





























Corpus Baseline | LDA-based | HDP-based 
CISI 20.0% 23.28% 23.10% 
CRANFIELD 39.2% 44.55% 45.41% 
MEDLINE 51.8% 61.57% 62.34% 
TIME MAGAZINE | 52.6% 58.76% 55.88% 








Table 5.9: Improvements over the baseline for all corpora. Shown is the result of the respective best parameter 


configuration. 
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CHAPTER 6: 
Conclusions and Recommendations 





This chapter discusses the experiment results from Chapter 5 and possible implica- 
tions for the SHARE corpus. Further, it will give recommendations for future research on 


information retrieval based on probabilistic topic models. 


6.1 Discussion of Experimental Results 


Chapter 5 shows that probabilistic topic models can lead to significant improvements 
in information retrieval. A probabilistic topic model, however, cannot be used as a single 
model for information retrieval purposes successfully. In general, such a model will have a 
feature space that is too coarse to effectively discriminate documents given a user query. The 
smoothed language model, which is entirely keyword-based, can be trained independently of 
the probabilistic topic model. 

It is desirable to have an efficient implementation for topic detection based on HDP, 
ideally as part of the MALLET (McCallum, 2002) package. The current detour of using 
Octave, Matlab, and Java to produce a trained document model for information retrieval is 


not efficiently usable for a standalone prototype. 


6.1.1 Evaluation for SHARE 


All experiments were done on abstracts from special domains. These are comparable 
with descriptions in SHARE’s card library, which will be available for search to the user. 
These experiments, however, cannot replace experiments on the target corpus. 

Once a sufficient number of documents from the SHARE corpus is available, it is 
therefore recommended to produce benchmark queries to adjust parameter settings. In addi- 
tion, a final implementation of the search engine should collect user feedback and submitted 
queries to grow the number of benchmark queries. This allows the adjustment of model 


parameters during the whole life cycle of the application. 


6.1.2 Preprocessing 


Stemming leads to significant improvement in retrieval performance and storage re- 
quirements. Therefore, it is recommended that stemming be applied to the documents in 
SHARE as well. 
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Removing rare types has been shown as possibly disadvantegeous for information 
retrieval. Although the statistical significance has been shown for three of four corpora, the 
practical significance needs to be assessed separately. Removing rare types has, however, the 
advantage of reducing storage and computation time, and should therefore be considered. 


6.1.3 Parameter Settings 


For the prior parameter setting, 3 = 0.007 is recommended. This value performed 
well for all sample corpora. The burn-in time for LDA should be 1,000 iterations or more; a 
lag of 100 iterations between samples from each chain and at least 10 samples are necessary 
for a robust point estimate from a single chain. Ten independent Gibbs sampler runs result 
in consistently good results in information retrieval. The number of topics is hard to predict 
for an unseen corpus. A number between 500 and 900 can be expected to perform well, but 
user feedback should be used to adjust that. In general, the number of topics can be expected 
to become larger as the number of word tokens in the corpus grows. Typically that means 
that the number of documents has to grow in this case as well. There is, however, no way 
of estimating this number in advance or as a function of the corpus size. For the HDP-based 
model, the number of topics does not need to be specified as it is computed by the inference 
algorithm. 

For the keyword-based model, a smoothing constant of js = 1100 showed best results 
on all corpora except the TIME MAGAZINE corpus. For TIME MAGAZINE, p = 1000 
performed slightly better. 

In all cases, the weighting of \ = 0.7 between the keyword-based and the topic-based 
model performed best. This result holds regardless if the topic model is LDA- or HDP-based. 


6.2 Hidden Markov Models for Topic Detection 


So far, all research presented in this thesis had the underlying assumption that ob- 
servations are exchangeable (see Section 2.0.1). Though the evaluated models performed 
significantly better in the information retrieval task compared to the baseline, the question 
remains if further improvement is possible by also acknowledging the word order, which is 
not random in reality. 

A refinement of LDA in that direction was presented by Griffiths et al. (2005). In this 
research, the authors defined a model in which each word is tagged not only by a topic index, 
but also by a syntactic class index (e.g., noun, verb, adjective). The document generating pro- 
cess therefore gets an additional step: after determining the topic for word 2, its syntactic state 


48 


is defined and then the word is drawn from a multinomial distribution that is conditioned not 
only on the topic, but also on the syntactic state. The sequence of these syntax labels is de- 
termined by a Hidden Markov Model (HMM), defining an order on the words in a document. 
The authors show that document models based on that model result in higher likelihood for 
a corpus than plain LDA. The question is now whether a document model that incorporates 
syntactic structure can improve information retrieval performance. 

Hidden Topic Markov Models (HTMM), introduced by Gruber et al. (2007), represent 
another attempt to relax LDA’s modeling assumptions. Instead of seeing a document as a bag 
of words, the authors treat it as a bag of sentences. Topic changes are allowed only at the 
beginning of a sentence and all words inside the sentence share the same topic. Assuming 
that words inside the same sentence are generated by the same topic is very intuitive. The 
experimental results presented in Gruber et al. (2007) show a higher corpus likelihood than 
LDA. 

Another possible direction for improving topic models based on HMM is derived from 
HDP. Using HDP, it is possible to define an HMM in a nonparametric way. That is, the state 
space grows with the number of observations as the number of mixture components does 
in the HDP model presented in Section 2.2. A topic model based on such an HMM would 
combine the advantages of a nonparametric model and a model that does not rely on a bag of 
words assumption. The HDP-HMM is derived in Teh et al. (2006) and an inference algorithm 
is described. 

All three presented refinements should be considered in future research, because they 
can possibly lead to document models that improve information retrieval performance. A 
query, however, needs its own model in a word-order-based information retrieval system. This 
follows because a user, who submits a query to the application, is typically not concerned 
about the ordering of the keywords, nor does he provide a complete sentence from which 


syntactic states can be inferred. 


6.3. Empirical Priors for HDP 


So far, all refinements of the document models considered changes of the model. As 
noted in Section 2.2, HDP can have any probability measure as the base measure. Rather than 
using a symmetric Dirichlet distribution as base measure, the distribution should be learned 
from the data directly. 

As a step in this direction, a simple experiment was conducted using an asymmetric 


Dirichlet distribtuion as prior. For the distribution parameter, the smoothed language model 
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was used (see Chapter 4). The experimental results showed improvement in retrieval per- 
formance over the symmetric distribution. These results are not presented here, because an 
asymmetric base measure does not necessarily lead to an exchangeable process (Hansen and 
Pitman, 1998) and the inference scheme in Section 2.2.2 is not guaranteed to converge to the 
right distribution. 

McAuliffe et al. (2006) present an algorithm to compute an empirical base measure 
using Gibbs sampling scheme kernel methods. In their example, the authors use a kernel 
based on the normal distribution. For document modeling, the kernel should be based on the 
Dirichlet distribution or a Polya Tree (Ghosh and Ramamoorthi, 2003). Kernels based on the 
Dirichlet distribution have been used by Hinneburg et al. (2007) and Draeger et al. (2009). 
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APPENDIX A: 
Inference and Learning Algorithms 





A.1 Inference for LDA 


A.1.1 Expectation Propagation algorithm in R 





exp_propagation_LDA 


<- function(document, alpha, pwa, numIter=100) { 








# Initialize variables 
W <- length (document) 
T <- length (alpha) 





gamma_ <- alpha 





beta_ <- matrix(rep(0,W*T) ,nrow=W) 


beta_new <- beta_ 





s <- rep(1,W) 


Ss _new <- §s 


# Repeat until convergence 





for (i in l:numlIter) { 





# Loop through all words 
for (w in 1:W) { 


# Start with deletion 
gamma_w <- gamma_ —- beta_l[w, ] 


# if one of the gamma_w’s is negative, skip this word 





if (sum(gamma_w<0)==0 & document [w]>0) { 


# Moment matching 


dp_pwa_gamma <- pwal[w, ]%*x%gamma_w 





sum_gamma <-— sum(gamma_w) 


ay 


zw <- dp_pwa_gamma/sum_gamma 
prefactor <- 1/zw *« gamma_w/sum_gamma 


m <- prefactor *(pwa[w,]+ dp_pwa_gamma) 











/ (1+sum_gamma) 

m2 <- prefactor *(gamma_w +1)/(1+ sum_gamma) 
* (2xpwalw,]+ dp_pwa_gamma) 
/ (2+sum_gamma) 

gamma_prime <- (m-m2) / (m2-m*2) «m 

# 

# update 





# Define the step size 


mu <-— 1/document [w] 





# update variables tentatively 
beta_newl[w, ]<-mux (gamma_prime —- gamma_w) 
+ (1-mu) xbeta_[w, ] 
s_new[w] <-zw*xgamma (sum(gamma_prime) ) 
/prod (gamma (gamma_prime) ) 
xprod(gamma (gamma_w) ) 
/gamma (sum_gamma) 
# 
# inclusion 
gamma_new <- gamma_+document [w] 
x (beta_newl[w, ]-beta_l[w, ] ) 
if (sum(gamma_new<0) ==0) { 
gamma_<- gamma_new 
beta_[w,] <- beta_newl[w, ] 


s[w]<-s_new[w] 


} 


return (list (gamma=gamma_,beta=beta_,s=s) ) 
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A.1.2. Gibbs Sampler for LDA in R 


function (WS,DS,T,NN, ALPHA, BETA, Z=NA) { 
# 








WS is a vector of observations 


DS has the same length as WS and specifies the document, 





observation i came from 





NN is the number of iterations 











BETA is the prior parameter for the word distribution 





# 
# 
# 
# 
# ALPHA is the prior parameter for the topic distribution 
# 
# given a topic 

# 

# e1071 provides useful function for discrete distributions 
require (e1071) 

# 

# Create the return values 

# 


WP<-matrix(0,nrow=max (WS) ,ncol=T) 




















DP<-matrix (0,nrow=max (DS) ,ncol=T) 


ztot<-matrix(0,nrow=T,ncol=1) 





# 

# Create local and temp variables 
# 

topic<-—o0 

wbheta <- max(WS) «BETA 

# 

# Initialize the states 

# 


if (is.na(Z)) { 
Z<-matrix (0,nrow=length (WS) ,ncol=1) 
for(i in 1:length (WS) ) { 
wi <-WS[i] 
di <-DS[i] 





topic<-rdiscrete(1,rep(1/T,each=T),1:T) 
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} 


Z[i]<-topic 
WP [wi, topic] <- 





DP[di,topic] <- 
ztot [topic] <-z 


WP [wi,topic]+1 
DP[di,topic]+1 








tot [topic] +1 








else { # Start from previously saved state 


} 
it 


for (i in l:lengt 


wi <-WS[i] 
di <-DS[i] 
topic<-Z [ij 





WP [wi, topic] <- 





DP[di,topic] <- 
ztot [topic] <-z 


# Finally, start sam 


it 


for 


(i 


ter in 1:NN) { 





h (WS) ) { 


WP [wi,topic]+1 
DP[di,topic]+1 








tot [topic] +1 


pling 





# permutate the order of observations 


order<-sample (1:1 


for 


(ii in l:leng 








i<-order [ii] 





wi <-WS[i] 
di <-DS[i] 





topic <-Z[i] 
ztot [topic] <-z 


WP [wi, topic] <-WP[wi,topic]-1 
DP [di,topic]<-—DP [di,topic]-1 





# Compute prob 
probs<-— (WP [wi, 


# Sample from this discrete 


topic <- rdisc 


ength (WS) ) 
th (WS) ) { 


tot [topic]-1 





abilities p(w_ 


]+BETA) / (ztot4 





rete(1,probs, 1] 
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i|z) 


twbeta) « (DP[di, ]+ALPHA) 





distribution 
:T) 


#Update the topic counts 
Z[i]<-topic 

WP [wi, topic] <-WP [wi,topic]+1 
DP [di,topic]<-—DP [di,topic]+1 
ztot [topic] <-ztot [topic]+1 














} 
return (list (WP=WP, DP=DP, Z=Z, ZTOT=ztot) ) 
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A.2 Inference and learning for HDP 


A.2.1 Gibbs Sampling for HDP 


function (hdp, alphaO, gamma, lambda, numiter, vocab) 


{ 


parameters 


hdp is a list of numeric vector with word indices 





alphaO is the concentration parameter for each DP 





gamma is the concentration parameter for the HDP 


numiter is the number of iterations 





Se oe oR OR HE SHR 





vocab is the vocabulary 


# create some local and intermediate variables 


V<-lLength (vocab) 


#use internal representation as Griffiths & Steyvers (2006) 


WS<-numeric() 





DS<-numeric() 
for (i in 1:length (hdp) ) { 
WS <- c(WS, hdp[[i]]) 
DS<-—c (DS, rep({i, length (hdp{[[i]]))) 








} 


K<-1 # number of assigned clusters 





beta <- rdirichlet (1,c(length (hdp) , gamma) ) 


# create the output matrices (these have to grow ...) 


# wt vocab x topic, dt document x topic 





wt <- matrix(0, V, 1) # one class at first 
dt <- matrix(0, length (hdp),1) 





# Initialize the relevant vectors 


zvec<-rep (1,length(WS)) # First try, assign all to one 





# mvec holds the number of tables for each 
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# restaurant serving dish k 


mvec <- matrix(rep(0, length (hdp)), length (hdp) , K) 


for (1 in l:numiter) { 


if (1 %% 10 ==0) 








print (paste("iteration",1, sep=": ")) 
# slot is a temp variable that iterates 


# over all observations 





slot <- 
# now the Gibbs updates 
for (j in 1:length (hdp) ) { 

for (i in l:length(hdp[[j]]))¢ 


pzji <- numeric (Kt1) 








# Take out the current observation 





wt [WS[slot],zvec[slot] ] 
<- max(0,wt[WS[slot], zvec[slot]] -1) 
dt[j,zvec[slot]] <- max(0, dt[j,zvec[slot]] -1) 


# always sample for one more topic 
for (k in 1:(K+1)) { 
if (k < K+t1){ 








# number of customers in restaurant j 
# having dish k 
njk <- dt[j,k] 


pzjilk] <- (njk+alpha0O*beta[k]) 
* (wt [WS[slot],k]+lambda) 
/(sum(wt[,k])+V*«*lambda) 





} 


else { 
# if k=k_new 
pzji(k] <- alphaOxbeta[k]/V 
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zvec[slot]<-sample (x=1: (K+1),size=l1, prob=pzji) 








# here we have to increase the number of classes 
if (zvec[slot]>K) { 


K <- K+l 





wt<-matrix (wt,V,K) 

wt[,K]<-rep(0,V) 
dt<-matrix (dt, length (hdp) , K) 
dt[,K]<-rep(0, length (hdp) ) 

mvec <- matrix(mvec, length (hdp), K) 
mvec[,K] <- rep(0,length (hdp) ) 


# Now, put everything back in order 

wt [WS[slot], zvec[slot] ] 

<- wt [WS[slot], zvec[slot]] +1 

dt [DS[slot], zvec[slot] ] 
<- dt[DS[slot], zvec[slot]] +1 


























# sample the number of tables 
for (k in 1:K) 
mvec[j,k]<-sample_tables 
(alpha0, beta[k],dt[j,k]) 
# sample beta 


beta<- rdirichlet (1,c (apply (mvec,2, sum) ,gamma) ) 








# Delete empty classes 

wtGzero <- apply (wt, 2, sum)>0 
wt<-matrix(wt[,wtGzero],V) 
dt<-matrix(dt[,wtGzero], length (hdp) ) 
mvec<-matrix(mvec[,wtGzero], length (hdp) ) 
K<-sum(wtGzero) 

# fix the z vector 


for (i in l:length(wtGzero) ) { 





# this was an empty class before 
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if (!wtGzero[il]) 





zvec[zvec>i]<-zvec[zvec>i]-1 


} 
slot <- sloté+l 


} 
dimnames (wt) [1]<—-list (words=vocab) 


return (list (WT=wt, DT=dt, Z=zvec)) 





A.2.2 Auxiliary Variable Sampling 








function (prior, numTables, numDraws, numIter) { 
# a and b are the parameters for the gamma prior 
a<- prior[1] 


b<- prior[2] 


J is the number of restaurants 





<- length (numDraws) 





Initialize auxiliary variables 





MO 4+ GQ tk 


<- sample(c(0,1),size=J, replace=TRUE) 


<- runif (J) 


= 





# Declare the concentration parameter 
alphaO <- 0 


for (i in l:numlIter) { 








#First, sample alpha0d 


alphaO <- rgamma(l1,atnumTables-sum(s), 





scale=b-sum(log(w) ) ) 


then resample s 








<- numDraws/alpha0/ (numDraws/alpha0O +1) 


<- apply(as.matrix(p),1, function(x) rbinom(1,1,x)) 


+t WO = 


Finally, resample w 
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w <- apply (as.matrix (numDraws), 1, 





function(x) rbeta(1, alphaO+l, x)) 


return (list (alphaO=alpha0, s=s, w=w) ) 
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APPENDIX B: 
Implementation Examples 





B.1 A MEX Function to Compute the PMF for the Number 
of Mixture Components in a CRP 


#include <stdio.h> 
#include <stdlib.h> 
#include <math.h> 








#include "mex.h" 





void mixpmf (double xp, int n, double g){ 
AT Sp 2)? 
double *xpp, xtmp; 
pp=mxCalloc(n, sizeof (double) ); 


for (1=2; i<n; itt) { 
Pe LOL) spo) Paar; 
for (Hle asa7 es) peli l= Calera alfredo); 
tmp=p; 
P=Ppp; 
pp=tmp; 








void mexFunction(int nlhs, mxArray x*plhs[], 


int nrhs, const mxArray *prhs[]) { 
double g, *p; 
int n; 
n=(int) (mxGetScalar(prhs[0])); 
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g= (double) (mxGetScalar(prhs[1])); 





plhs [0]=mxCreateDoubleMatrix(1,n,mxREAL) ; 
p=mxGetPr(plhs[0]}); 
mixpmf (p,n,g); 


B.2. Matlab Code to Train HDP Model 


fe} 


% create the prior distribution parameters 





pML = zeros (vocabSize,1); 


numTokens=0; 





numDoc =size (restaurants,2); 





for (i=1:numDoc) 





numTokens=numTokens+size(restaurants{i},2); 


end 





for (i=1:numDoc) 


for (j=l:size(restaurants{i},2)) 








pML(restaurants{i}(j)) = pML(restaurants{i}(j)) +1; 
end 


end 


hh=pML/numTokens*200; 


%$ create the matrix for the sum of the models 


pTlotal = zeros (vocabSize,size(restaurants,2)); 


% create the prior parameters on the hyperparameters 
% (\gamma~ Gamma (1,0.1) 

% and \alpha_0O~Gamma(1,1)) 

alphaa=[ 1 1]; 
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alphab=[ .1 1]; 


%$ train the different Markov chains 


for (c=1:10) 
% train the model 
[hdp, sample, lik, predlik] = hdp2Multinomial_run ( 
hh,alphaa, alphab, 200, 
restaurants, restaurants,1000, 
20'5  WO0%. Ae dup Vy hy ds 


% obtain the topics as smoothed 

















% distributions over words 

p_w_z=hdp.base.classqg(:,1l:hdp.base.numclass); 

for (i = l:hdp.base.numclass) 
p_w_z(:,1)=(p_w_z(:,1i)+hh) 

./(sum(p_w_z(:,1i))+sum(hh)); 

end 

% obtain the documents as distributions over topics 

p_z_d=zeros (hdp.base.numclass,size(restaurants,2)); 


for (1 = 2: (size(restaurants,2)+1)) 





p_z_d(:,i-1)=hdp.dp{i}.classnd(1l:hdp.base.numclass); 


end 





$ normalize column wise 


p_z_d=p_z_dx«diag(1./sum(p_z_d,1)); 





% generate the final model 





p_w_d=p_w_z*p_z_d; 
fprintf(’RUN %d finished\n’,c) 





plotal = pTotal + p_w_d; 
end 
% compute the averag 


p_w_d = pTotal/10; 
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B.3. Java Methods 


B.3.1 Average Precision Computation 











ee 

x @return the average precision for the 

* retrieval task based on the 

* benchmark corpus. 

x / 

public double[] computeAveragePrecision() { 
int numQ = queries.size(); 
double[] aP = new double[numQ]; 
for (int i = O; i < numQ; itt) f 


double pN = 0; 
// each Query object "knows" its 


// relevant documents 








ArrayList<Integer> relDocs = 





queries.get(i).getRelevantDocs(); 





int numRel = relDocs.size(); 





int numFound = 0; 

// qres is an ordered list of documents 
SearchResults gqRes = results.get (i); 
int lastRelRank = 1; 





int r = 1; 





double sumPn = 0; 





for (DocScore d: gRes) { 





if (relDocs.contains(d.id())) { 
numFound++; 
pN = (pN * lastRelRank + 1) / r; 


sumPn += pN; 





lastRelRank = r; 


// if all relevant documents are found, 


if (numFound == numRel1) 
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stop 


aP[i] = sumPn / numRel; 


} 
// veturn the mean average precision 


return aP; 


B.3.2 Computation of Mean Average Precision 

















public double computeMeanAveragePrecision() { 
double[] aP = computeAveragePrecision(); 
double mAP = 0; 
for (int i = 0; i < aP.length; itt) 
mAP += aP[i]; 
return mAP / aP.length; 
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